Please sign-up for an account at Posit Cloud and accept our classroom invitation here: https://posit.cloud/spaces/689711/join?access_code=8kse5IYlL4kHIqZvKaQ6mXp8IMibFayMa10I8Izn
Who are you?
Name, pronouns, group you work in
What you want to get out of the class
What has brought you joy lately?
Database Concepts, DESCRIBE, SELECT, WHERE
JOINing tables
[No class week]
Calculating new fields, GROUP BY, CASE WHEN, HAVING
Subqueries, Views, Pizza
We offer a badge of completion when you finish the course!
What it is:
What it isn’t:
Requirements:
Benefits of a DBMS:
A user interface - how users interact with the database. In this class, our main way of interacting with databases is SQL (Structured Query Language).
An execution engine - a software system that queries the data in storage. These can live on our machine, on a server within our network, or a server on the cloud.
Data Storage - the physical location where the data is stored.
| This class | Example Hutch on-site database system | Example Hutch cloud database system | |
|---|---|---|---|
| User Interface | SQL | SQL | SQL |
| Execution Engine | DuckDB | SQL Server | Databrick/Snowflake |
| Data Storage | File on our machine | FH Shared Storage | Amazon S3 Bucket |
Relational Database: Data is organized into multiple tables. Tables are connected via columns that share the same elements across tables.
Person table
| person_id | year_of_birth | gender_source_value |
|---|---|---|
| 001 | 1/1/1999 | F |
| 002 | 12/31/1999 | F |
| 003 | 6/1/2000 | M |
Procedure Occurrence table
| procedure_occurrence_id | person_id | procedure_datetime |
|---|---|---|
| 101 | 001 | 4/1/2010 |
| 102 | 003 | 6/1/2022 |
| 103 | 004 | 5/1/2001 |
Entity Relationship Diagram
| name |
|---|
| care_site |
| cdm_source |
| concept |
| concept_ancestor |
| concept_class |
| concept_relationship |
| concept_synonym |
| condition_era |
| condition_occurrence |
| cost |
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| person_id | INTEGER | YES | NA | NA | NA |
| gender_concept_id | INTEGER | YES | NA | NA | NA |
| year_of_birth | INTEGER | YES | NA | NA | NA |
| month_of_birth | INTEGER | YES | NA | NA | NA |
| day_of_birth | INTEGER | YES | NA | NA | NA |
| birth_datetime | TIMESTAMP | YES | NA | NA | NA |
| race_concept_id | INTEGER | YES | NA | NA | NA |
| ethnicity_concept_id | INTEGER | YES | NA | NA | NA |
| location_id | INTEGER | YES | NA | NA | NA |
| provider_id | INTEGER | YES | NA | NA | NA |
If you look at the column_type for one of the DESCRIBE statements above, you’ll notice there are different data types:
INTEGERTIMESTAMPDATEVARCHARYou can see all of the datatypes that are available in DuckDB here.
SELECT and FROMSELECT is a clause that lets you pick out columns of interest. If you want all columns, use *.
FROM is a clause that lets you decide which table to work with.
| person_id | gender_concept_id | year_of_birth | month_of_birth | day_of_birth | birth_datetime | race_concept_id | ethnicity_concept_id | location_id | provider_id | care_site_id | person_source_value | gender_source_value | gender_source_concept_id | race_source_value | race_source_concept_id | ethnicity_source_value | ethnicity_source_concept_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 8532 | 1963 | 12 | 31 | 1963-12-31 | 8516 | 0 | NA | NA | NA | 001f4a87-70d0-435c-a4b9-1425f6928d33 | F | 0 | black | 0 | west_indian | 0 |
| 123 | 8507 | 1950 | 4 | 12 | 1950-04-12 | 8527 | 0 | NA | NA | NA | 052d9254-80e8-428f-b8b6-69518b0ef3f3 | M | 0 | white | 0 | italian | 0 |
| 129 | 8507 | 1974 | 10 | 7 | 1974-10-07 | 8527 | 0 | NA | NA | NA | 054d32d5-904f-4df4-846b-8c08d165b4e9 | M | 0 | white | 0 | polish | 0 |
| 16 | 8532 | 1971 | 10 | 13 | 1971-10-13 | 8527 | 0 | NA | NA | NA | 00444703-f2c9-45c9-a247-f6317a43a929 | F | 0 | white | 0 | american | 0 |
| 65 | 8532 | 1967 | 3 | 31 | 1967-03-31 | 8516 | 0 | NA | NA | NA | 02a3dad9-f9d5-42fb-8074-c16d45b4f5c8 | F | 0 | black | 0 | dominican | 0 |
| 74 | 8532 | 1972 | 1 | 5 | 1972-01-05 | 8527 | 0 | NA | NA | NA | 02fbf1be-29b7-4da8-8bbd-14c7433f843f | F | 0 | white | 0 | english | 0 |
| 42 | 8532 | 1909 | 11 | 2 | 1909-11-02 | 8527 | 0 | NA | NA | NA | 0177d2e0-98f5-4f3d-bcfd-497b7a07b3f8 | F | 0 | white | 0 | irish | 0 |
| 187 | 8507 | 1945 | 7 | 23 | 1945-07-23 | 8527 | 0 | NA | NA | NA | 07a1e14d-73ed-4d3a-9a39-d729745773fa | M | 0 | white | 0 | irish | 0 |
| 18 | 8532 | 1965 | 11 | 17 | 1965-11-17 | 8527 | 0 | NA | NA | NA | 0084b0fe-e30f-4930-b6d1-5e1eff4b7dea | F | 0 | white | 0 | english | 0 |
| 111 | 8532 | 1975 | 5 | 2 | 1975-05-02 | 8527 | 0 | NA | NA | NA | 0478d6b3-bdb3-4574-9b93-cf448d725b84 | F | 0 | white | 0 | english | 0 |
LIMIT n let’s you look at the first n entries.
We put multiple SQL clauses together to form a query.
Try it out yourself on procedure_occurrence table. Why is there a person_id column in this table as well?
SELECT for specific columnsInstead of * for all columns, we can specify the columns of interest:
| person_id | birth_datetime | gender_concept_id |
|---|---|---|
| 6 | 1963-12-31 | 8532 |
| 123 | 1950-04-12 | 8507 |
| 129 | 1974-10-07 | 8507 |
| 16 | 1971-10-13 | 8532 |
| 65 | 1967-03-31 | 8532 |
| 74 | 1972-01-05 | 8532 |
| 42 | 1909-11-02 | 8532 |
| 187 | 1945-07-23 | 8507 |
| 18 | 1965-11-17 | 8532 |
| 111 | 1975-05-02 | 8532 |
Try add race_concept_id and year_of_birth to your SELECT query.
WHERE - filtering our tableAdding WHERE to our SQL statement lets us add filtering to our query:
| person_id | gender_source_value | race_source_value | year_of_birth |
|---|---|---|---|
| 6 | F | black | 1963 |
| 123 | M | white | 1950 |
| 129 | M | white | 1974 |
| 16 | F | white | 1971 |
| 65 | F | black | 1967 |
| 74 | F | white | 1972 |
| 42 | F | white | 1909 |
| 187 | M | white | 1945 |
| 18 | F | white | 1965 |
| 111 | F | white | 1975 |
You don’t need to include the columns you’re filtering via WHERE in the SELECT part of the statement:
| person_id | gender_source_value | race_source_value |
|---|---|---|
| 6 | F | black |
| 123 | M | white |
| 129 | M | white |
| 16 | F | white |
| 65 | F | black |
| 74 | F | white |
| 42 | F | white |
| 187 | M | white |
| 18 | F | white |
| 111 | F | white |
WHERESingle quotes (‘M’) refer to values, and double quotes refer to columns (“person_id”).
This will trip you up several times if you’re not used to it.
| person_id | gender_source_value |
|---|---|
| 123 | M |
| 129 | M |
| 187 | M |
| 40 | M |
| 53 | M |
| 78 | M |
| 69 | M |
| 248 | M |
| 105 | M |
| 49 | M |
COUNT - how many entries?Sometimes you want to know the size of your result, not necessarily return the entire set of results. That is what COUNT is for.
| count_star() |
|---|
| 37409 |
COUNT DISTINCT for unique entriesWhen you have repeated values, COUNT(DISTINCT ) can help you find the number of unique values in a column:
| count(DISTINCT procedure_concept_id) |
|---|
| 51 |
We can also return the actual DISTINCT values by removing COUNT:
| procedure_concept_id |
|---|
| 4058899 |
| 4295880 |
| 4216130 |
| 4024289 |
| 4202451 |
| 4330583 |
| 4238715 |
| 4186930 |
| 4242997 |
| 4047491 |
Your turn: Count the distinct values of gender_source_value in person.
DESCRIBEOne of the important properties of data in a relational database is that there are no repeat rows in the database. Each table that meets this restriction has what is called a primary key.
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| person_id | INTEGER | YES | NA | NA | NA |
| gender_concept_id | INTEGER | YES | NA | NA | NA |
| year_of_birth | INTEGER | YES | NA | NA | NA |
| month_of_birth | INTEGER | YES | NA | NA | NA |
| day_of_birth | INTEGER | YES | NA | NA | NA |
| birth_datetime | TIMESTAMP | YES | NA | NA | NA |
| race_concept_id | INTEGER | YES | NA | NA | NA |
| ethnicity_concept_id | INTEGER | YES | NA | NA | NA |
| location_id | INTEGER | YES | NA | NA | NA |
| provider_id | INTEGER | YES | NA | NA | NA |
We'll see that primary keys need to be unique (so they can map to each row).
When we’re done, it’s best to close the connection with dbDisconnect().